In this Data Exploration assignment, you have two separate data sets with which you will work. The first involves the data generated by you and your classmates last week when you took the in-class survey. The second involves some of the data used in the Atkinson et al. (2009) piece that you read for class this week. Both data sets are described in more detail below.

If you have a question about any part of this assignment, please ask! Note that the actionable part of each question is bolded.

Part 1: Cognitive Biases

You may have noticed that the questions on the survey you took during class last week were based on the Kahneman (2003) reading you did for this week. The goal for this set of questions is to examine those data to see if you and your classmates exhibit the same cognitive biases that Kahneman wrote about. The data you generated is described below.

Data Details:

Variable Name Variable Description
Unique ID for each respondent
From the rare disease problem, the program chosen by the respondent (either ‘Program A’ or ‘Program B’)
From the rare disease problem, the framing condition to which the respondent was assigned (either ‘save’ or ‘die’)
From the Linda problem, the option the respondent thought most probable, either “teller” or “teller and feminist”
From the cab problem, the respondent’s estimate of the probability the car was blue
One of “man”, “woman”, “non-binary”, or “other”
Year at Harvard
Indicator for whether or not the respondent has taken a college-level statistics course

Before you get started, make sure you replace “file_name_here_1.csv” with the name of the file. (Also, remember to make sure you have saved the .Rmd version of this file and the file with the data in the same folder.)

# load the class-generated bias data
bias_data <- read_csv("bias_data.csv")

Question 1

First, let’s look at the rare disease problem. You’ll recall from the Kahneman (2003) piece that responses to this problem often differ based on the framing (people being saved versus people dying), despite the fact that the two frames are logically equivalent. This is what is called a ‘framing bias’.

Did you all exhibit this bias? Since the outcomes for this problem are binary, we need to test to see if the proportions who chose Program A under each of the conditions are the same. Report the difference in proportions who chose Program A under the ‘save’ and ‘die’ conditions. Do we see the same pattern that Kahneman described?

##       die      save 
## 0.3488372 0.6666667

The proportion under Program A who chose save is .3488. The proportion under Program B who chose die is .6667.

EXTENSION: Report the 95% confidence interval for the difference in proportions you just calculated. Hint: the infer package has a function that is useful here. What does the 95% confidence interval mean?

prop_test(bias_data, rare_disease_prog ~ rare_disease_cond, order = c("die", "save"))

Note that extensions to questions are not the same as data science questions. Complete this question if you like, but it is not required for data science students like actual data science questions.

Question 2

Now let’s move on to the Linda problem. As we read in Kahneman (2003), answers to this problem tend to exhibit a pattern called a “conjunction fallacy” whereby respondents overrate the probability that Linda is a bank teller a feminist rather than just a bank teller. From probability theory, we know that the conjunction of two events A and B can’t be more probable than either of the events occurring by itself; that is, \(P(A) \ge P(A \wedge B)\) and \(P(B) \ge P(A \wedge B)\).

What proportion of the class answered this question correctly? Why do you think people tend to choose the wrong option?

The proportion of students that correctly stated it was more likely that Linda is a teller is 0.7058824

Question 3

What attributes of the respondents do you think might affect how they answered the Linda problem and why? Using the data, see if your hypothesis is correct.

I hypothesized that women would be more likely to be incorrect than men when answering the Linda question due to positive association with the term ‘feminism.’ THis hypothesis is validated, as the proportion of women who answered correctly was 0.6571429, and the proportion of men who answered correctly was 0.75.

Question 4: Data Science Question

Now we will take a look at the taxi cab problem. This problem, originally posed by Tversky and Kahneman in 1977, is intended to demonstrate what they call a “base rate fallacy”. To refresh your memory, here is the text of the problem, as you saw it on the survey last week:

The most common answer to this problem is .8. This corresponds to the reliability of the witness, without regard for the base rate at which Blue cabs can be found relative to Green cabs. In other words, respondents tend to disregard the base rate when estimating the probability the cab was Blue.

What is the true probability the cab was Blue? Visualize the distribution of the guesses in the class using a histogram. What was the most common guess in the class?

The true answer is 41%. Here’s the histogram of our answers though:

cab_histo

The most common guess appears to have been 0.8.

Part 2: Political Faces

Now you will investigate some of the data used in Atkinson et al. (2009). These data cover Senate candidates from 1992-2006 and include face ratings, partisanship, incumbent status, and other variables.

Data Details:

Variable Name Variable Description
The assessment of the Senate race from the Cook Political Report in the year prior to the election
The year of the election
The state in which the candidate was running
The normalized rating of the candidate’s perceived competence based on an image of the face
An indicator variable for whether the candidate was an incumbent
The candidate’s name
The candidate’s political party
An indicator variable for whether the race was one of two “tossup” categories according to Cook
A unique identifier for the photo of the candidate

As before, make sure you replace “file_name_here_2.csv” with the name of the file.

face_data <- read_csv("senate_data.csv")

As an example of how you might write your own code to analyze these data, let’s take a look at whether there was a difference in the perceived competence of Democratic and Republican candidates’ faces. We can examine this question graphically using a density plot.

# make density plot of perceived competence by party
ggplot(data = face_data, aes(x = face_rating, color = party)) + # note that by setting color = party, 
  geom_density()                                                # the face ratings of each party will be 

                                                                # displayed in different colors

We can also consider this statistically using a t-test for whether or not the mean face ratings are significantly different across parties.

# conduct a t-test of difference-in-means
difference_in_means(face_rating ~ party, data = face_data)
## Design:  Standard 
##           Estimate Std. Error  t value  Pr(>|t|)    CI Lower  CI Upper       DF
## partyrep 0.1044044 0.09565385 1.091482 0.2756698 -0.08360089 0.2924098 431.5741

Neither the graphical nor the statistical approaches suggest a significant difference in perceived competence of candidate faces by party.

Question 5

Do the data suggest a significant difference between perceived competence of incumbent vs. non-incumbent candidate faces? How do your findings relate to the results and theory of Atkinson et al. (2009)?

Question 6

Do the data suggest a significant difference between perceived competence of non-incumbent candidate faces in tossup vs. non-tossup races? What might explain any similarities or differences between these results and those from the previous question? How do your findings relate to the results and theory of Atkinson et al. (2009)?

Question 7: Data Science Question

Atkinson et al. (2009, 236) suggest that “…incumbents from the most competitive districts would have higher facial quality than incumbents from the most safe incument districts due to the selection process of better faces to competitive districts, inducing a negative relationship betwen indumbent face and incument vote.” Do the data support the idea that seat safety is negatively correlated with incumbent facial quality? Make a plot to visualize this relationship. Note that this question may require you to define at least one new variable.

## Warning: Ignoring unknown aesthetics: candidate_year
## 
## Call:
## lm(formula = face_data_supp$face_rating ~ face_data_supp$cook_quant)
## 
## Coefficients:
##               (Intercept)  face_data_supp$cook_quant  
##                   -0.1382                     0.0757

Question 8

Is there something else interesting or informative that you could explore using either of these datasets? If so, run it by a TF and try it out here.